Goto

Collaborating Authors

 tabular data




QATCH: Benchmarking SQL-centric tasks with Table Representation Learning Models on Your Data

Neural Information Processing Systems

Table Representation Learning (TRL) models are commonly pre-trained on large open-domain datasets comprising millions of tables and then used to address downstream tasks. Choosing the right TRL model to use on proprietary data can be challenging, as the best results depend on the content domain, schema, and data quality. Our purpose is to support end-users in testing TRL models on proprietary data in two established SQL-centric tasks, i.e., Question Answering (QA) and Semantic Parsing (SP). We present QATCH (Query-Aided TRLChecklist), a toolbox to highlight TRL models' strengths and weaknesses on relational tables unseen at training time. For an input table, QATCH automatically generates a testing checklist tailored to QA and SP. Checklist generation is driven by a SQL query engine that crafts tests of different complexity. This design facilitates inherent portability, allowing the checks to be used by alternative models. We also introduce a set of cross-task performance metrics evaluating the TRL model's performance over its output. Finally, we show how QATCH automatically generates tests for proprietary datasets to evaluate various state-of-the-art models including TAPAS, TAPEX, and CHATGPT.


DOFEN: Deep Oblivious Forest ENsemble

Neural Information Processing Systems

Deep Neural Networks (DNNs) have revolutionized artificial intelligence, achieving impressive results on diverse data types, including images, videos, and texts.


Transferable Adversarial Robustness for Categorical Data via Universal Robust Embeddings

Neural Information Processing Systems

Research on adversarial robustness is primarily focused on image and text data. Yet, many scenarios in which lack of robustness can result in serious risks, such as fraud detection, medical diagnosis, or recommender systems often do not rely on images or text but instead on tabular data. Adversarial robustness in tabular data poses two serious challenges. First, tabular datasets often contain categorical features, and therefore cannot be tackled directly with existing optimization procedures. Second, in the tabular domain, algorithms that are not based on deep networks are widely used and offer great performance, but algorithms to enhance robustness are tailored to neural networks (e.g.



22456f4b545572855c766df5eefc9832-Supplemental.pdf

Neural Information Processing Systems

We use t-SNE [37] to project each real/fake record onto a 2-dim space. We summarize the statistics of our datasets as follows: 1. Adult has 22K training, 10K testing records with 6 continuous numerical, 8 categorical, and 1 discrete numerical columns. News has 32K training records, 8K testing records with 45 continuous numerical, 14 categorical, and 0 discrete numerical columns. We introduce one more visualization with Creditin Figure 4. IT-GAN(Q)shows the best similarity between the real and fake points. We compare our method with the following baseline methods, including state-of-the-art VAEs and GANs for tabular data synthesis and our IT-GAN's three variations: 1. Indis a heuristic method that we independently sample a value from each column's groundtruth distribution. We use these baselines' hyperparameters recommended in their original paper and/or GitHub repositories.


Trans Tab: Learning Transferable Tabular Transformers Across Tables

Neural Information Processing Systems

Tabular data (or tables) are the most widely used data format in machine learning (ML). However, ML models often assume the table structure keeps fixed in training and testing. Before ML modeling, heavy data cleaning is required to merge disparate tables with different columns.


Synthcity: a benchmark framework for diverse use cases of tabular synthetic data

Neural Information Processing Systems

Accessible high-quality data is the bread and butter of machine learning research,1 and the demand for data has exploded as larger and more advanced ML models are2 built across different domains. Yet, real data often contain sensitive information,3 subject to various biases, and are costly to acquire, which compromise their quality4 and accessibility. Synthetic data have thus emerged as a complement, sometimes5 even a replacement, to real data for ML training. However, the landscape of6 synthetic data research has been fragmented due to the large number of data7 modalities (e.g., tabular data, time series data, images, etc.) and various use cases8 (e.g., privacy, fairness, data augmentation, etc.). This poses practical challenges9 in comparing and selecting synthetic data generators in different problem settings.10 To this end, we develop Synthcity, an open-source Python library that allows11 researchers and practitioners to perform one-click benchmarking of synthetic data12 generators across data modalities and use cases. In addition, Synthcity's plug-in13 style API makes it easy to incorporate additional data generators into the framework.14 Beyond benchmarking, it also offers a single access point to a diverse range of15 cutting-edge data generators. Through examples on tabular data generation and16 data augmentation, we illustrate the general applicability of Synthcity, and the17 insight one can obtain.18